Skip to content

[None][fix] Fix _waiting_requests to use compute tokens with KV cache reuse#12521

Open
lancelly wants to merge 8 commits intoNVIDIA:mainfrom
lancelly:fix_delay_batching
Open

[None][fix] Fix _waiting_requests to use compute tokens with KV cache reuse#12521
lancelly wants to merge 8 commits intoNVIDIA:mainfrom
lancelly:fix_delay_batching

Conversation

@lancelly
Copy link
Copy Markdown
Collaborator

@lancelly lancelly commented Mar 25, 2026

Description

_waiting_requests() in py_executor.py uses get_tokens(0) to compute the scheduled context token count for the batch_wait decision. get_tokens(0) returns the full input
sequence length
(e.g., ~34K tokens for agentic workloads), which always exceeds the batch_wait_max_tokens_ratio * max_num_tokens threshold. This makes batch_wait_timeout_iters a
complete no-op when KV cache block reuse is enabled.

Fix: Subtract estimated_reusable_tokens (already set by the capacity scheduler's radix tree lookup) to get the actual compute tokens, matching how the micro batch scheduler
itself accounts for cache reuse.

Before (always False → never waits):

num_scheduled_ctx_tokens = sum(
    len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
# → 34742 tokens >> 4096 threshold → should_waiting = False

After (correctly reflects compute cost):
for ctx_req in context_requests:
    req_tokens = len(ctx_req.get_tokens(0))
    reusable = ctx_req.estimated_reusable_tokens if ctx_req.is_first_context_chunk else 0
    num_scheduled_ctx_tokens += max(1, req_tokens - reusable)
# → 1106 compute tokens < 4096 threshold → should_waiting = True

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->

## Summary by CodeRabbit

* **Bug Fixes**
* Enhanced token accounting in the request scheduler to accurately reflect KV cache reuse during context processing.

* **Tests**
* Added test coverage for request waiting behavior with various KV cache reuse scenarios.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
@lancelly lancelly requested a review from a team as a code owner March 25, 2026 03:17
@lancelly lancelly requested a review from achartier March 25, 2026 03:17
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 25, 2026

📝 Walkthrough

Walkthrough

Updated token accounting in _waiting_requests() method to subtract reusable KV cache tokens for context requests and clamp per-request effective token contribution to a minimum of 1, with corresponding test coverage for KV reuse scenarios.

Changes

Cohort / File(s) Summary
Token Accounting Logic
tensorrt_llm/_torch/pyexecutor/py_executor.py
Modified _waiting_requests() to account for KV cache reuse: replaced direct token length summation with explicit per-request loop that subtracts estimated_reusable_tokens for non-first context chunks and clamps effective token count to minimum of 1 per request.
Test Coverage for Waiting Logic
tests/unittest/_torch/executor/test_py_executor.py
Added helper function _make_ctx_request, mock executor MockPyExecutorForWaiting, and test suite TestWaitingRequests with four test cases covering: no KV reuse, first context chunk reuse, non-first context chunk behavior, and per-request token clamping.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: fixing _waiting_requests to account for KV cache reuse when computing scheduled context tokens.
Description check ✅ Passed The PR description explains the problem, root cause, and solution clearly. However, the Test Coverage section required by the template is missing.
Docstring Coverage ✅ Passed Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/unittest/_torch/executor/test_py_executor.py`:
- Line 223: The long boolean assignment to should_waiting exceeds line-length
limits; split the expression across lines using parentheses and intermediate
variables if helpful. Locate the assignment to should_waiting (uses
self.batch_wait_iters_count, self.batch_wait_timeout_iters,
num_scheduled_tokens, self.batch_wait_max_tokens_ratio, self.max_num_tokens) and
reformat it so each comparison and the multiplication are on separate lines (or
extract num_scheduled_tokens < ... into a named variable) to keep each line <120
chars while preserving the same logic and operands.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 29b96125-8bf0-4aad-81ad-0f1ef8ebff36

📥 Commits

Reviewing files that changed from the base of the PR and between b130996 and 2bd5013.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tests/unittest/_torch/executor/test_py_executor.py

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

1 similar comment
@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40247 [ run ] triggered by Bot. Commit: fac40e3 Link to invocation

@lancelly lancelly requested a review from SimengLiu-nv March 25, 2026 06:45
lancelly pushed a commit to lancelly/TensorRT-LLM that referenced this pull request Mar 25, 2026
… reuse

Cherry-pick from PR NVIDIA#12521. _waiting_requests() was using full input
sequence length (get_tokens(0)) which always exceeded the batch_wait
threshold when KV cache reuse is enabled. Now subtracts
estimated_reusable_tokens to get actual compute tokens.

Signed-off-by: Lance Liao <laliao@login-preos02.a51.clusters.nvidia.com>
Made-with: Cursor
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40247 [ run ] completed with state SUCCESS. Commit: fac40e3
/LLM/main/L0_MergeRequest_PR pipeline #31376 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Copy link
Copy Markdown
Collaborator

@SimengLiu-nv SimengLiu-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in tensorrt_llm/_torch/pyexecutor/py_executor.py looks good to me. Request changes in the testing code.

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40416 [ run ] triggered by Bot. Commit: 4df69cb Link to invocation

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40416 [ run ] completed with state SUCCESS. Commit: 4df69cb
/LLM/main/L0_MergeRequest_PR pipeline #31509 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

1 similar comment
@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40446 [ run ] triggered by Bot. Commit: a1d6c29 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40447 [ run ] triggered by Bot. Commit: a1d6c29 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40447 [ run ] completed with state SUCCESS. Commit: a1d6c29
/LLM/main/L0_MergeRequest_PR pipeline #31538 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Copy link
Copy Markdown
Collaborator

@SimengLiu-nv SimengLiu-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40475 [ run ] triggered by Bot. Commit: a1d6c29 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40475 [ run ] completed with state SUCCESS. Commit: a1d6c29
/LLM/main/L0_MergeRequest_PR pipeline #31562 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40490 [ run ] triggered by Bot. Commit: a1d6c29 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40490 [ run ] completed with state SUCCESS. Commit: a1d6c29
/LLM/main/L0_MergeRequest_PR pipeline #31578 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40496 [ run ] triggered by Bot. Commit: a1d6c29 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40496 [ run ] completed with state SUCCESS. Commit: a1d6c29
/LLM/main/L0_MergeRequest_PR pipeline #31583 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40524 [ run ] triggered by Bot. Commit: a1d6c29 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40524 [ run ] completed with state SUCCESS. Commit: a1d6c29
/LLM/main/L0_MergeRequest_PR pipeline #31610 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lancelly and others added 2 commits March 28, 2026 02:44
The mock class calls PyExecutor._waiting_requests with self as the
mock instance, but _waiting_requests internally calls
self._compute_scheduled_tokens which is a @staticmethod on PyExecutor.
Without exposing it on the mock, Python's Mock auto-generates a
Mock-returning attribute, causing numeric comparison failures.

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40540 [ run ] triggered by Bot. Commit: c42d178 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40540 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/28.

Link to invocation

@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40544 [ run ] triggered by Bot. Commit: c42d178 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40544 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/28.

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants